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gap by currently available datasets. At the time of writing, our datasets contain 10 full 
HD (1920 x 1080) video clips with annotated JSON file, which is in total 100 minutes 
of duration and the total size of 13 GB. We believe this dataset will be useful as a 
training and benchmark data for a variety of research topics regarding human facial 
and emotion recognition. 
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1. INTRODUCTION 

Nowadays, a mobile device such as a smartphone is equipped with a high quality camera and a good 
processor which enabled people to record a high definition (HD) video. There is no need to buy an expensive 
digital or DSLR camera for the people who want to record a video with a good quality result. Based on the 
observation performed by Ofcom [1] in 2016, in the USA, people spent approximately 87 hours on average a 
month to browse on a smartphone compared to 34 hours on laptop or desktop. This signifies that people tend 
to use a smartphone for most of their activities. It means that the feature on their smartphone is more than 
enough for their daily needs, and this includes the video recording. Statista [2] also claimed that 71% of 44,761 
respondents use their smartphone to take photos/videos, which is the second highest activity in the smartphone 
after accessing the internet. This indicates the camera in their smartphones is already satisfied for taking images 
and recording video compared to a few years before. 

Despite the increasing amount of time people spend on their smartphone, there is no publicly avail- 
able dataset regarding Asian facial feature that is captured using mobile smartphone camera. Generally, the 
dataset only exists still as images and most of the video dataset only covers the western person facial features. 
As literature suggested, although facial expression recognition is universal to all races of humans, emotions 
perception from facial expressions cues are quite different from one culture to others. Moreover, most of the 
datasets focus only on facial features of the video and there is no such thing as facial features when a person 
is talking with the others in the wild. Hence, by using smartphone, we could capture a natural conversation of 
two interlocutors in the wild. Such datasets serve as datasets for computer to learn emotions recognition, facial 
expression recognition features and classification. 

In this work, we are aiming to address this gap by presenting a mobile video dataset that contains 


Journal Homepage: http://iaescore.com/journals/index.php/IJECE 


IJECE ISSN: 2088-8708 4043 


videos of Asian people having a natural conversation with each other. Our dataset is obtained by recording a 
natural conversation between 2 people inside a controlled room with adequate lighting and the mobile camera 
is in a steady position. In order to collect the variety of facial features, we provide several topics to be chosen 
by the interlocutors. The topics are mostly general topics such as foods, lecturers, etc. 

The rest of this paper is organized as follows: In Section 2., we list the existing different publicly 
available dataset and explain its difference with our dataset. Our method of collecting data and the characteristic 
of the data will be explained in Section 3. Potential applications of the dataset are described in Section 4. And 
lastly, the conclusion will be provided in Section 5. 


2. RELATED WORKS 

Currently, several datasets have been created for many kinds of recognition tasks, especially facial ex- 
pression analysis and recognition [3, 4, 5]. However, only few datasets that contain Asian people and recorded 
by using a smartphone. The extended Cohn-Kanade database, CK+ [6] composed of 593 recordings of posed 
and non-posed sequences. It is recorded under controlled conditions of light and head motion, and range be- 
tween 9-60 frames per sequence. Each sequence represents a single changing facial expression that starts with 
a neutral expression and ends with a peak expression. The transitions between expressions are not included. 
Moreover, there is an NRC-IIT database [7] which contains pairs of short low-resolution mpeg 1-encoded video 
clips. Each video clip is showing a face of a user who sits in front of the monitor that exhibiting a wide range 
of facial expressions and orientations as captured by a USB webcam mounted on the computer. Every video 
clip is about 15 seconds long, has a capture rate of 20 fps and is compressed with the AVI Intel codec 481 Kbps 
bit-rate. 

In the other hands, there is the Cohn-Kanade DFAT-504 dataset [8] that consists of 100 university 
students ranging in age from 18 to 30 years. 65% were female, 15% were African-American, and 3% were 
Asian or Latino. Students were instructed by an experimenter to perform a series of 23 facial expressions. 
Students began and ended each display with a neutral face. Image sequences from neutral to target display 
were digitized into 640 by 480 pixel arrays with 8-bit precision for grayscale values. Similar to the others, 
the MMI database [9] contains a large collection of FACS coded facial videos. However, it consists of 1395 
manually AU coded video sequences with the majority of the video is posed and recorded in laboratory settings. 

In the dataset mentioned above, most of them only focus on the image part of the video without 
recording the audio and makes the dataset quite unnatural for several expression. The dataset above mostly 
used camera or webcam to take the video. Also the duration of each data considered to be too short for 
applications in real life condition which combined different aspects and the context of the topic with the facial 
expression in the video. In summary, our dataset is different in following points: (i) recorded using the camera 
in a mobile device; (ii) long duration of natural conversation video; (iii) includes full HD videos; and (iv) 
includes audio for matching expression with context. 


3. PROPOSED METHOD 

In this paper, we proposed a new mobile video dataset, which can be used as a benchmark data for 
several recognition tasks as well as serve as a dataset for machine learning tasks. Here, we start by explaining 
how we collect the dataset for this research. The aim of this research is providing a publicly available dataset 
for Asian (specifically Indonesian) facial features, expressions, and conversations. Thus, we recruited twenty 
volunteers, whose mainly are Indonesian students (age between 19 - 21) to participate in this research. The 
participants are given a list of possible topics to be discussed during the recording in 10 minutes. In one session 
of recording, there were two interlocutors sitting facing each other across the table. The participants then start 
the conversation in with the other interlocutor, when the researcher give signal to them to start the conversation. 

To record the conversation, the researchers set up 2 smart phone with identical camera specification. 
The smart phone used in this research were two Xiaomi Mi 4i with 13MP, f/2.0 camera and the video was 
recorded in full a HD setting with resolution of 1920x1080 pixels and 30 fps. The smart phone were placed in 
a steady position in front of each interlocutor. The recorded video is depicted in the following Figure 1 where 
Figure la shows the first interlocutor who involves in this conversation with a specific topic selected, with the 
other interlocutor is illustrated in Figure 1b. 

During the conversation, the volunteers are encouraged to behave as if it is a normal and natural con- 
versation in order to get the most natural dataset possible. After 10 minutes, they will be reminded to stop the 
conversation and the video will be saved as an mp4 file with the MPEG-4 format in the device before it is ex- 
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(a) First Interlucutor (b) Second Interlucutor 


Figure 1. Example of recorded video in a conversation between two interlocutors 


1{ 

2 “frames":{ 

3 a ra | 

4 Sia, 

5 “yi":30, 

6 "x2" :469, 

7 "y2":243, 

8 “id":6, 

9 “width":746, 
10 “height":419, 
11 “type”: "Rectangle", 
12 "tags":["neutral"], 
13 "name":1, 
14 “blockSuggest": true 
15 }], 
16 eat 
17 "x1" 3295, 
18 “Yi see, 
19 "x2" 471, 
20 "y2":259, 
21 K an 
22 "width" :746, 
23 "height" :419, 
24 “type”: "Rectangle", 
25 “tags":["neutral"], 
26 “name":1, 
27 “suggestedBy":{"frameId":1, “regionId":0}, 
28 “bLlockSuggest": true 
29 H. 


Figure 2. Part of annotated JSON example 


ported to a computer. Each file will be named in the format as follows: °>CONVERSATIONID_CAMERAID”, 
where CONVERSATIONID is the identifier for the session and CAMERAID is the identifier for the device 
used. After the data is completely recorded, we start annotating the video for three different facial expressions, 
i.e., sad, neutral, and happy. Using visual object tagging tool (VOTT) [10] provided by Microsoft, we can 
get the annotation data in JSON format as shown in Figure 2.Furthermore, Figure 3 also describes the video 
example while annotating the data. 


4. POTENTIAL APPLICATIONS 

There are several potential applications that can take the advantages of the availability of this dataset, 
such as facial recognition and emotion detection. 

Facial Recognition, our dataset provides another data in order to increase the accuracy of facial 
recognition task. The reason is that most research in this field is using CK+ dataset as done by Bartlett et al. 
[11], Cohen et al. [12], and Cohn et al. [13]. Unfortunately, the current used datasets mostly contain people 
from the western region, which can highly affect the Asian facial recognition. 

Emotion Detection, most of other research regarding emotion detection only used specific datasets 
like visual only or audio only [14]. Our dataset provides multi-modal information, i.e., images and audio that 
linked together in this case. Moreover, the common research used is posed expression datasets that are not 
based on authentic emotions [15]. 

Virtual Humans or Intelligent VIrtual Agents, with the dataset, we can learn a natural conversation 
between two interlocutors and implement them into a virtual human [16, 17]. The dataset provides several 
features to be learn: emotion recognition from audio (i.e. voice) and audio (e.g. facial expressions), natural 
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00:05:56 / 00:04:05 


neutral 


Figure 3. Example of the video while annotation process. The red square denotes the facial area of annotation. 


language processing, and conversation. 

Psychology or Social Science Study, with the dataset, researchers from psychology or social study 
also could analyze and observe human behavior during the interaction. An ethnography study also can be 
applied to analyze or to observe human behavior (specifically Indonesian people) through the video. 


5. CONCLUSION 

This paper presents conversation video dataset, containing videos of real conversation performed by 
a pair of volunteers recorded using mobile device camera along with JSON data of its annotation. In future, 
we intend to collect more data for the datasets by asking for more diverse volunteers based on age, gender, and 
occupation. We believe that this dataset will be useful for several applications which required training using 
images, audio, or videos from our datasets. We want also to record the video using different conditions of 
lighting in order to observe the influence of the lighting. 
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